Evaluation on the performance of the bank’s recent marketing campaigns

Presentation Video Link = https://binusianorg-my.sharepoint.com/personal/karina_wardoyo_binus_ac_id/_layouts/15/guestaccess.aspx?share=Eu9ZPmcecUNOl4Ch_qA2X9YBo3SyreCdMsLYkp0ZlNAnnw&e=CdiaCa

##   age        job  marital         education default housing loan   contact
## 0  83    retired divorced          basic.4y      no      no   no  cellular
## 1  32   services  married       high.school      no      no   no telephone
## 2  31     admin.   single university.degree      no     yes   no telephone
## 3  33     admin.   single university.degree      no     yes   no telephone
## 4  39   services  married       high.school      no     yes   no telephone
## 5  57 management  married university.degree      no      no   no  cellular
##   month day_of_week duration campaign pdays previous    poutcome   y
## 0   nov         tue      242        1     3        3     success yes
## 1   may         tue      190        3   999        0 nonexistent  no
## 2   may         wed      325        2   999        0 nonexistent  no
## 3   aug         thu      158        1   999        0 nonexistent  no
## 4   jul         mon      158        1   999        0 nonexistent  no
## 5   jul         tue      101        1   999        0 nonexistent  no

     Reading the CSV, also making the 1st column became the index for rows using row.names=1 but after that column ‘X’ will no longer show as column but became index, after that display the first 6 rows of data

Question 1: Data Quality Assessment and Preprocessing

1a. Identify data quality issues

Missing values

##         age         job     marital   education     default     housing 
##           0           0           0           0           0           0 
##        loan     contact       month day_of_week    duration    campaign 
##           0           0           0           0          12           0 
##       pdays    previous    poutcome           y 
##           0           0           0           0

     Based on what we checked there is only 1 column with missing values, that is the duration column with 12 missing values.

Duplicated values

## [1] 0

     Based on what we checked there is no duplicated data that exist in the dataset.

Unknown categories

## 
## Column: job 
## 
##                      admin.   blue-collar  entrepreneur     housemaid 
##             8          2033          1870           294           219 
##    management       retired self-employed      services       student 
##           546           366           263           822           184 
##    technician    unemployed       unknown 
##          1372           196            65 
## 
## Column: marital 
## 
## divorced  married   single  unknown 
##      892     4994     2338       14 
## 
## Column: education 
## 
##            basic.4y            basic.6y            basic.9y         high.school 
##                 817                 486                1243                1915 
##          illiterate professional.course   university.degree             unknown 
##                   6                1043                2379                 349 
## 
## Column: default 
## 
##      no unknown 
##    6491    1747 
## 
## Column: housing 
## 
##      no unknown     yes 
##    3749     205    4284 
## 
## Column: loan 
## 
##      no unknown     yes 
##    6741     205    1292 
## 
## Column: contact 
## 
##  cellular telephone 
##      5232      3006 
## 
## Column: month 
## 
##  apr  aug  dec  jul  jun  mar  may  nov  oct  sep 
##  529 1189   38 1481 1074  105 2729  827  149  117 
## 
## Column: day_of_week 
## 
##  fri  mon  thu  tue  wed 
## 1524 1676 1769 1599 1670 
## 
## Column: poutcome 
## 
##     failure nonexistent     success 
##         836        7136         266 
## 
## Column: y 
## 
##   no  yes 
## 7329  909

     Based on what we checked in categorical columns there is a few unknown categories and a few blank values such as ” ” in job column and “unknown” in job, marital, education, housing and loan columns.

Outliers in numerical columns

##         age         job     marital   education     default     housing 
##        TRUE       FALSE       FALSE       FALSE       FALSE       FALSE 
##        loan     contact       month day_of_week    duration    campaign 
##       FALSE       FALSE       FALSE       FALSE        TRUE        TRUE 
##       pdays    previous    poutcome           y 
##        TRUE        TRUE       FALSE       FALSE

     Based on what we checked the numerical columns are age, duration, campaign, pdays, previous. So, we will only make the boxplots of these columns for checking outliers and extreme values, except the pdays because most of the datas is on 999 which means not previously contacted

     Based on the boxplot the biggest outlier is age 89 but technically it is still possible for someone to be 89 years old, so the outliers in age column is considered valid outliers.

     Based on the boxplot the biggest outlier is 971 seconds and above that is considered extreme values, the most extreme value is on 4918 seconds.

     Based on the boxplot the biggest outlier is 9 contacts made, and it is still not considered extreme, and data above 9 is considered extreme, with the most extreme is 43 contacts made

     Based on the boxplot the biggest extreme values is 6 contacts made, and no outliers were identified so anything above 0 means extreme values

Looking for irrelevant values

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##    4    4   13   87   29    7   97   11    1    7    7    1    4    7    4    4 
##   17   19   25   26  999 
##    1    1    1    1 7947

     The irrelevant values aren’t the exact data value, irrelevant values are sometimes considered as a value that symbolises something, in this case 999 in the column pdays means not previously contacted. The 999 can actually affect the mean and median of this column.

1b. Handle the issues you identified

Make a dataset called ‘clean_data’

     Make a duplicate of df and called clean_data the clean_data is the dataset that we will modify and make it cleaner and ready to get processed

Cleaning missing values using median

## Missing values in duration after imputation: 0
## Median used for imputation: 180 seconds

     Searching the median for duration column which is 180 and all missing values has been replaced with the median value which is 180, this makes there are no longer missing value in duration column. Also, because duration is a numerical column it is better to fill the missing value with median because it is more robust to outliers and extreme values.

Check again whether median input is successful or not

##         age         job     marital   education     default     housing 
##           0           0           0           0           0           0 
##        loan     contact       month day_of_week    duration    campaign 
##           0           0           0           0           0           0 
##       pdays    previous    poutcome           y 
##           0           0           0           0

     If we check again for the missing values in df but now is copied and modified to clean_data, the duration shows that it has 0 missing values. Also, from this result this indicates that none of the column has missing values.

Cleaning missing values or unknown values in categorical columns using mode

## 'Unknown' and blank values are replaced with most frequent values (mode) appeared in their respective category, the modes are like the following:
## Job: admin.
## Marital: married
## Education: university.degree
## Housing: yes
## Loan: no

     Also, for the categorical data, I chose to fill the blanks and unknowns with the mode of the columns, this shows that I am assuming the unknowns and the blanks has the same value as the majority values, but in here the column ‘default’ has the answer ‘unknown’ but since unknown here has meaning that is why I didn’t convert it to the mode of the default column, but just let it be.

Checking whether cleaning categorical columns by inputting mode works or not

## 
## Column: job 
## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##          2106          1870           294           219           546 
##       retired self-employed      services       student    technician 
##           366           263           822           184          1372 
##    unemployed 
##           196 
## 
## Column: marital 
## 
## divorced  married   single 
##      892     5008     2338 
## 
## Column: education 
## 
##            basic.4y            basic.6y            basic.9y         high.school 
##                 817                 486                1243                1915 
##          illiterate professional.course   university.degree 
##                   6                1043                2728 
## 
## Column: default 
## 
##      no unknown 
##    6491    1747 
## 
## Column: housing 
## 
##   no  yes 
## 3749 4489 
## 
## Column: loan 
## 
##   no  yes 
## 6946 1292 
## 
## Column: contact 
## 
##  cellular telephone 
##      5232      3006 
## 
## Column: month 
## 
##  apr  aug  dec  jul  jun  mar  may  nov  oct  sep 
##  529 1189   38 1481 1074  105 2729  827  149  117 
## 
## Column: day_of_week 
## 
##  fri  mon  thu  tue  wed 
## 1524 1676 1769 1599 1670 
## 
## Column: poutcome 
## 
##     failure nonexistent     success 
##         836        7136         266 
## 
## Column: y 
## 
##   no  yes 
## 7329  909

     If we check again for unknown in job, marital, education, housing, loan or ” ” in job it will no longer exist because it has been replaced with the mode of each respective columns.

1c. Summary statistics and initial exploration

Summary Statistics for Numerical Variables
Variable Mean Median SD IQR Min Max
age 39.93 38 10.51 15 17 89
duration 258.07 180 266.56 217 4 4918
campaign 2.57 2 2.79 2 1 43
pdays 963.90 999 183.41 0 0 999
previous 0.17 0 0.49 0 0 6
     Above is the central tendency and variability for numerical columns. The numbers for pdays are a little bit confusing because the median and max has the same value, because like what we analyze before that pdays are filled with 999 as the value.
Age Distribution by Job Type
job Count Mean_Age Median_Age SD_Age Min_Age Max_Age
admin. 2106 38.4 36 9.1 21 70
blue-collar 1870 39.3 38 9.0 20 80
technician 1372 38.4 36 8.5 22 70
services 822 37.6 36 9.3 20 60
management 546 42.4 41 9.3 23 77
retired 366 61.8 59 10.5 26 89
entrepreneur 294 40.7 39 9.0 20 60
self-employed 263 39.9 39 9.4 23 60
housemaid 219 45.8 45 10.6 23 82
unemployed 196 39.8 39 8.9 23 66
student 184 26.0 26 4.9 17 44
     Above is the summary table of age by job type with admin as the majority and retired as minority. The youngest in age is 20 which is in the blue-collar and services job, also the oldest in age is 89 in retired category
Campaign Outcome Summary
y Count Percentage Avg_Age Avg_Duration Avg_Campaign
no 7329 89 39.8 221.5 2.6
yes 909 11 40.8 552.8 2.0

     Above is the summary table of campaign outcome distribution, most of the campaign didn’t come to a success since a lot of the clients rejected or doesn’t subscribe to a term deposit. The one that rejects has the average age of ~40 and it took an average of 3 contacts to make them reject and those who accepted has the average age of ~41 and it took them an average of 2 contacts to be convinced.

Question 2: Relationship Analysis

2a. Explore variable relationships

Correlation of Numeric Variables with Campaign Outcome
Variable Pearson Spearman
age 0.029 -0.015
duration 0.389 0.339
campaign -0.069 -0.066
previous 0.219 0.184
     After checking the correlation score between numeric variables and campaign outcome using pearson and spearman method, we can see that the highest correlation that a numeric variable has with campaign outcome is on the duration column, so we can go ahead and do a grouped summary of those 2 columns.
Call Duration by Campaign Outcome
y Count Mean_Duration Median_Duration SD_Duration Min_Duration Max_Duration
no 7329 221.5 165 210.8 4 4918
yes 909 552.8 439 433.7 63 4199
     Clients mostly rejected the offer because the count for ‘no’ is 7329 and ‘yes’ is only 909, clients who accepted the term deposit or answered ‘yes’ had longer average contact duration in the value 552.8 seconds compared to those who rejected or answered ‘no’ the average contact duration is 221.5 seconds. This suggests that longer conversations are more likely to result in successful outcomes.
Campaign Success Rate by Job Type
job no yes Total SuccessRate
student 130 54 184 29.3
retired 276 90 366 24.6
unemployed 171 25 196 12.8
admin. 1844 262 2106 12.4
self-employed 234 29 263 11.0
technician 1222 150 1372 10.9
management 488 58 546 10.6
housemaid 196 23 219 10.5
services 749 73 822 8.9
entrepreneur 268 26 294 8.8
blue-collar 1751 119 1870 6.4
     The largest job group and says yes the most is admin but however it doesn’t have the biggest success rate the success rate for admin is only 12.4%. The job categories with the highest success rate is student with the rate of 29.3% success rate. Also, blue collar has the lowest success rate maybe due to it being the 2nd largest group but only 6.4% said yes.
Current Campaign Success by Previous Campaign Outcome
poutcome no yes Total Success_Rate
success 98 168 266 63.2
failure 728 108 836 12.9
nonexistent 6503 633 7136 8.9

     This table shows how past campaign results can actually predict current success. Clients with previous successful campaigns have a 63.2% success rate in the current campaign, which is dramatically higher than those with previous outcome being failures (has a 12.9% success rate) or no previous contact has ever been made before and this is their first campaign with the success rate of 8.9%. This indicates that clients who responded positively before are much more likely to respond positively again, suggesting strong customer loyalty and satisfaction with the bank’s services. The pattern indicates that it is important to stay in good terms with previously successfully convinced client.

2b. Feature engineering

Distribution of Age
Age_Dist Count
<30 1187
>50 1443
30-50 5608
     The age group distribution shows that the 30-50 age group is the largest group with 5,608 clients, followed by clients over 50 years old which has the value of 1,443 clients, and clients under 30 years old with 1,187 clients. This indicates that the bank’s clients are dominated by people who is in their 30 - 50s.
Distribution of Recency Level
Recency_Dist Count
never 7947
old 2
recent 289
     The recency level distribution shows that the majority of client with the value of 7,947 clients have never been contacted ever, and tehre are 289 clients were contacted in the last 3 weeks and only 2 clients were contacted in more than 3 weeks ago. This means that most clients in this campaign are being approached for the first time.
Distribution of Total Contact Score
Total_Contact_Score Count
1 3038
2 2188
3 1285
4 633
5 351
6 226
7 155
8 75
9 55
10 55
11 35
12 19
13 17
14 17
15 12
16 13
17 15
18 11
19 2
20 4
21 6
22 3
23 1
24 2
25 2
26 2
27 4
29 3
30 2
31 3
33 1
35 2
43 1

     The total contact score distribution shows that most clients, 3.038 clients in total, have a score of 1, meaning they were contacted only once across all campaigns. The frequency decreases as the contact score increases, with 2.188 clients having a score of 2 contacts, and 1.285 clients having 3 contacts made. Very few clients have high contact value, with only 1 client having the maximum value of contact made that is on 43 contacts. This indicates that the bank typically makes minimal contact attempts with most clients, and follow-up contact is rarely done.

2c. Interactive visualizations of relationships

Between 2 categorical variables

     For interactive visualization between 2 categorical variable, I chose to use the same columns as I did for the variable correlation analysis in 2a, which are job and y. Here the bar chart compares job types with campaign outcomes, sorted by total client count from lowest to highest (so the chart is prettier and easier to read). Admin workers show the highest numbers for both ‘yes’ and ‘no’. Students and retired clients, even when they have smaller sample sizes, appear to have relatively higher acceptance rates compared to other columns, retired clients have 90 ‘yes’ which is on the 4th place if we sort the frequency of ‘yes’ from highest to lowest, and students are in 7th place with n=54 which means the count for ‘yes’ is 54. The visualization clearly shows that while admin and blue-collar workers make up the majority of contacts, their success rates are low compared to students. Like we analyzed before in correlation, student has 29.3% success rate while admin has 12.4% success rate and blue-collar has 6.4% success rate.

Between numerical and categorical variables

     For the categorical and numerical variable visualization, I chose y and age. This interactive boxplot shows the age distribution for each current campaign outcome, which is yes and no. The successful and unsuccessful campaign outcome show similar median ages around late 30 age, those who said ‘no’ has the median of 38 years old and those who says ‘yes’ has the median of 36 years old, with the successful group having a slightly lower median age. The interquartile ranges are also quite similar between the two groups, suggesting that age alone is not be a strong predictor of whether the campaign is successful or not. But, the successful group shows a slightly better distribution with fewer outliers.

Multi dimensional

     This multi-dimensional bar plot shows current campaign success rates grouped by age group column that we made in 2b, also in each of the age group it has more category inside which is the education level column again by education level. The visualization reveals how education level influences campaign success within different age groups. Higher education levels, particularly university degrees, professional courses and high school, consistently show better success rates across all age groups, with the youngest group (under 30 years old) achieving the highest success rates among all age group with the success rate of their university degree clients. The middle-aged group (30-50 years old) shows okay success rates with university degree and it has the highest success rate among the age group but the university degree on this age group is the lowest among other age group, while the oldest group (above 50 years old) demonstrates the a moderate success rates across all education levels, with the highest success rate is with clients with a basic 4y education and the lowest success rate is clients with basic 9y education level. This pattern means that both younger age and higher education can impact campaign acceptance significantly/positively, though the >50 age group shows a different education preference pattern compared to younger groups with the highest success rate is those who have basic 4y education level, because maybe the era is different too in terms of education.

Question 3: Make business recommendations

Chart 1

3a. Data storytelling through visual design

3b. Interpret your findings

     This chart reveals that campaign success rates will increase with longer call duration, because the lowest success rate is 0.7% and it’s for very short calls under 100 seconds then it grew to a success rate of 47.3% for calls lasting more than 600 seconds. Even the difference between medium and long duration is pretty significant with almost 30%. So the possible correlation here is because genuinely engaged customers will naturally comply to participate in longer discussions and that is where bank representatives can properly explain benefits and address concerns of the campaign they are doing currently.

3c. Make business recommendations

     First, implement minimum call duration standards of 300 seconds (5 minutes), as calls under this time cap have less than 8% success probability and this will mean the effort that the bank representative has done is wasted. Second, teach agents to slow down and have real conversations with customers instead of rushing through calls, because taking more time leads to much better results. Bank representatives then can explain benefits clearly, and take time to answer questions, since our data proves that longer conversations of 10+ minutes are much more successful than rushing through quick calls.

Chart 2

3a. Data storytelling through visual design

3b. Interpret your findings

     This chart reveals that customer loyalty from previous campaigns is the strongest predictor of current campaign success, showing a dramatic 7x difference between satisfied and new customers. First-time contacts achieve only 8.9% success rate, previous failures show slight improvement at 12.9%, but customers with previous successful experiences shows loyalty with the bank with 63.2% success rate. The massive gap between previous success (63.2%) and first-time contacts (8.9%) and previous failure (12.9%) indicates that satisfied customers are willing to engage in future offerings. In this case, the clients who previously accepted the campaign are more likely to accept the current campaign because maybe the bank really does give a good benefit for it. This pattern suggests that maintaining good relationships with existing successful customers is effectively impacting the future campaign outcomes too, this might be better than searching new prospects.

3c. Make business recommendations

     First, prioritize campaigns targeting customers with previous successful outcomes, since they have a 63.2% success rate compared to only 8.9% for new contacts, this means focusing on satisfied customers is way more effective than contacting new prospects. Second, create a customer loyalty program to keep successful customers engaged between campaigns, because the data proves that once customers trust the bank and have positive experiences, they become highly likely to accept future offers. Bank representatives should maintain regular contact with these valuable customers through follow-up calls or special offers to make them want to remain there being satisfied and ready for the next campaign.